gitindex: replace go-git blob reading with pipelined git cat-file --batch by clemlesne · Pull Request #1021 · sourcegraph/zoekt

clemlesne · 2026-03-20T13:34:24Z

Summary

Replace serial go-git BlobObject calls in indexGitRepo with a single pipelined git cat-file --batch --buffer subprocess
A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline
--buffer switches git from per-object write_or_die flush to libc stdio buffering, reducing syscalls
Submodule blobs fall back to the existing go-git createDocument path

Benchmarks

On kubernetes (29,188 files, 261 MB content), Apple M1 Max, median of 5 runs:

	go-git (before)	cat-file pipelined (after)	Change
Time	3.05s	0.62s	-80% (4.9x)
Allocs	685K	58K	-92%
Memory	670 MB	277 MB	-59%

Reproducing:

git clone --depth=1 https://github.com/kubernetes/kubernetes /tmp/k8s
ZOEKT_BENCH_REPO=/tmp/k8s go test ./gitindex/ -bench='BenchmarkBlobRead_(GoGit|CatfilePipelined)/files=29188' -benchmem -count=5 -timeout=600s

Context

Filed as #1016. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.

Prior art: PR #424 benchmarked git cat-file --batch at 2.68x faster than go-git. PR #940 states "eventually we want to move away from go-git towards a vanilla direct git usage."

Design

The pipelined reader (readBlobsPipelined in gitindex/catfile.go):

Starts git cat-file --batch --buffer with the repo as working directory
Writer goroutine feeds all blob SHAs via a 64KB bufio.Writer, then closes stdin to trigger git's final fflush(stdout)
Main goroutine reads responses in order via a 512KB bufio.Reader, parsing the <oid> <type> <size>\n<content>\n protocol
On early exit (read error), drains stdout so git can flush and exit, then waits for the writer goroutine and cmd.Wait()
Missing objects are marked via blobResult.Missing (matches existing ErrObjectNotFound → skippedLargeDoc behavior)

Integration in indexGitRepo (gitindex/index.go):

Separates main-repo keys (SubRepoPath == "") from submodule keys
Collects blob SHAs in document-sorted order, calls readBlobsPipelined once
Builds index.Document from pipelined results for main-repo blobs
Falls back to go-git createDocument for submodule blobs

Per-blob allocations minimized to 1 (the content []byte itself) via stack-allocated hex buffers, ReadBytes for headers, and LastIndexByte for size parsing.

Test plan

go test ./gitindex/ -race passes (all existing tests + 2 new)
go vet ./gitindex/ clean
golangci-lint run ./gitindex/ — no new issues
New unit tests: TestReadBlobsPipelined (normal, empty, binary, large blobs), TestReadBlobsPipelined_WithMissing
New benchmarks: BenchmarkBlobRead_GoGit (baseline), BenchmarkBlobRead_CatfilePipelined (production path)
Benchmark on a second large repo (e.g. torvalds/linux)

🤖 Generated with Claude Code

…atch Replace the serial go-git BlobObject calls in indexGitRepo with a single pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline that eliminates per-object packfile seek overhead and leverages git's internal delta base cache. Submodule blobs fall back to the existing go-git createDocument path. Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs: go-git BlobObject (before): Time: 2.94s Allocs: 685K Memory: 691 MB cat-file pipelined (after): Time: 0.60s Allocs: 58K Memory: 276 MB Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory

keegancsmith

Awesome stuff. See inline, but the reason I am requesting changes is I want to avoid reading into memory files that we decide to not index due to being too large. I believe you should be able to shape the code written here into something which allows that.

keegancsmith · 2026-03-20T14:53:32Z

gitindex/index.go

+	// Build the ordered list of (name, key) pairs and collect blob SHAs for
+	// the main repo so we can read them all at once via git cat-file --batch.
+	type indexEntry struct {
+		key       fileKey
+		blobIndex int // index into blobResults, or -1 for submodule blobs
+	}
+	entries := make([]indexEntry, 0, totalFiles)
+	mainRepoIDs := make([]plumbing.Hash, 0, totalFiles)


I think it makes sense to rather do two loops. One over mainRepoIDs and instead have another list for submoduleIDs. The actual code is quite minimal for submoduleIDs and the duplication then between what we do once we have created a doc is minimal.

Done — split into two separate loops. mainRepoKeys iterated with the streaming reader, submoduleKeys iterated with go-git createDocument. The builder sorts documents internally via sortDocuments so ordering between the two loops does not matter.

keegancsmith · 2026-03-20T14:57:17Z

gitindex/index.go

+	var blobResults []blobResult
+	if len(mainRepoIDs) > 0 {
+		var err error
+		blobResults, err = readBlobsPipelined(opts.RepoDir, mainRepoIDs)


I think there is a problem with this current API in that you read everything into memory. This just won't scale for very large repos. It's part of why zoekt works with this design of adding documents that at certain thresholds decide to create a shard (which then rely on the GC to free memory).

Additionally previously if a blob was over our max size, we just wouldn't read it in to memory.

I would suggest instead designing the API to be like the archive/tar API. So you would still pass in a huge slice of mainRepoIDs. But it would then return a stateful struct over the output from the process. For each file we get a little bit of metadata (like size, etc) and then the caller here can read that into a []byte or call .Next to get it to skip over the file.

So I think to keep the nice property around just passing in all IDs, we still get git to output all objects. But here in go land we decide between actually reading the object into a []byte or just io.Discarding it.

You don't have to go crazy on the design of the API. Just do something nice and minimal to suit this use case.

Agreed, reworked. readBlobsPipelined is replaced by a streaming catfileReader — archive/tar.Reader-style:

cr, _ := newCatfileReader(opts.RepoDir, mainRepoIDs) for _, key := range mainRepoKeys { size, missing, _ := cr.Next() if size > SizeMax { continue } // unread bytes auto-discarded by next Next() content := make([]byte, size) io.ReadFull(cr, content) }

Next() reads the header and returns size — content is not read until the caller calls Read

Calling Next() again without reading discards unread bytes via bufio.Reader.Discard

Large blobs never touch Go memory — only the header is parsed, content is discarded in the bufio buffer

Peak memory is now bounded by ShardMax (one shard's worth of content), not total repo size

Replace the bulk readBlobsPipelined (which read all blobs into a []blobResult slice) with a streaming catfileReader modeled after archive/tar.Reader: cr, _ := newCatfileReader(repoDir, ids) for { size, missing, err := cr.Next() if size > maxSize { continue } // auto-skipped, never read content := make([]byte, size) io.ReadFull(cr, content) } Next() reads the cat-file header and returns the blob's size. The caller decides whether to Read the content or skip it — calling Next() again automatically discards unread bytes via bufio.Reader.Discard. Large blobs over SizeMax are never allocated or read into Go memory. Also split the single interleaved loop into two: one for main-repo blobs streamed via cat-file, one for submodule blobs via go-git's createDocument. The builder sorts documents internally so ordering between the loops does not matter. Peak memory is now bounded by ShardMax (one shard's worth of content) rather than total repository size.

clemlesne · 2026-03-20T17:11:52Z

Updated benchmarks after the streaming rewrite (median of 5 runs, Apple M1 Max, kubernetes 29,188 files):

	go-git (before)	catfileReader (after)	Change
Time	2.41s	0.52s	-78% (4.6x)
Allocs	690K	58K	-92%
Memory	751 MB	275 MB	-63%

The streaming API is faster than the previous bulk []blobResult approach (0.62s → 0.52s) because there's no intermediate slice — content goes directly from the pipe into the caller's buffer. And large blobs over SizeMax are now skipped without reading content into memory at all.

keegancsmith

Ok did a much deeper review now. It's looking great! Just have a few inline comments then we should be able to ship this.

keegancsmith · 2026-03-24T12:23:10Z

gitindex/catfile.go

+	// Writer goroutine: feed all SHAs then close stdin to trigger flush.
+	writeErr := make(chan error, 1)
+	go func() {
+		defer stdin.Close()


also defer close(writeErr) so if you call Close twice it won't block forever.

Done — added defer close(writeErr) in the goroutine. Also wrapped Close() itself in sync.Once (same pattern as DirectoryWatcher.Stop in search/watcher.go) so that cmd.Wait() is also called exactly once. This makes Close fully idempotent — safe to call multiple times or concurrently. Added TestCatfileReader_DoubleClose and TestCatfileReader_ConcurrentClose that verify this.

keegancsmith · 2026-03-24T12:26:40Z

gitindex/catfile.go

+}
+
+// Close shuts down the cat-file process and waits for it to exit.
+func (cr *catfileReader) Close() error {


I worry about what happens when we call Close before git is finished giving us all its output. eg if we get an error during builder.Add we will call close before reading everything.

Maybe instead this should just always call something like cr.cmd.Kill() (I can't remember offhand what it is). Then I think all the intermediatte goroutines/etc will drain.

I do quite like that you nicely clean up here, but I want to make sure we have good behaviour for early termination.

Good call — switched to kill-first. Close() now calls cmd.Process.Kill() before draining. This matches Gitaly's pattern — their cache eviction code has an explicit comment: "We first cancel the context such that the process gets a SIGKILL signal. Calling close() first may lead to a deadlock given that it waits for the process to exit, which may not happen if it hangs writing data to stdout."

After Kill, the drain completes instantly (stdout pipe gets EOF), the writer goroutine gets EPIPE on its next write and exits, and cmd.Wait() returns. The expected "signal: killed" exit error is suppressed via isKilledErr().

Added TestCatfileReader_CloseBeforeExhausted_ManyBlobs (200 files, reads only 1, then Close) and TestCatfileReader_CloseWithoutReading to exercise this.

keegancsmith · 2026-03-24T12:30:51Z

gitindex/catfile.go

+	if lastSpace == -1 {
+		return 0, false, fmt.Errorf("unexpected header: %q", header)
+	}
+	size, err = strconv.ParseInt(string(header[lastSpace+1:]), 10, 64)


minor: lets parse this as an int instead of int64 since all call sites work with int and then you can avoid the cast to int when you do Discard

Done — Next() now returns int, pending field is int, uses strconv.Atoi. Removed all casts at call sites (Discard, size comparison against SizeMax, make([]byte, size)).

keegancsmith · 2026-03-24T12:42:19Z

gitindex/index.go

+			var doc index.Document
+
+			if missing {
+				doc = skippedLargeDoc(key, branches)


have you come across missing in your tests? Saying we skipped it due to large file doesn't seem correct. We should either fail or if this has a common reason add a new indexing reason for missing a document.

You're right — added SkipReasonMissing with explanation "object missing from repository". Missing blobs now get skippedDoc(key, branches, index.SkipReasonMissing) plus a warning log. This is unexpected for local repos (could indicate corruption, shallow clone, or a race with git gc), so the warning helps operators diagnose it.

Updated file_category.go to handle SkipReasonMissing the same way as TooLarge — guess category from filename, don't examine content.

The go-git createDocument path keeps using SkipReasonTooLarge for ErrObjectNotFound since that has a different semantic (large objects excluded during fetch).

keegancsmith · 2026-03-24T12:44:41Z

gitindex/index.go

+					doc = skippedLargeDoc(key, branches)
+				} else {
+					content := make([]byte, size)
+					if _, err := io.ReadFull(cr, content); err != nil {


minor: unusual to repeatadly call io.ReadFull on the same reader. You might be able to expose a nicer api with the use of io.LimitedReader. But this is a very minor point, this does work!

Considered this — io.ReadFull into a pre-allocated buffer is actually more efficient here since the size is known: zero intermediate allocations vs io.ReadAll(io.LimitReader(...)) which grows a buffer internally. Added a comment explaining the choice. The catfileReader already limits reads internally via the pending field (same approach as archive/tar.Reader), so an extra LimitedReader wrapper would be redundant.

What I meant is the catfileReader API exposes a LimitedReader rather than directly using it's Read implementation. But yeah, this all works so was more just me thinking it through. For example I think that is how the archive reader API likely works.

keegancsmith · 2026-03-24T12:45:56Z

gitindex/index.go

+
+	// Stream main-repo blobs via pipelined cat-file --batch --buffer.
+	// Large blobs are skipped without reading content into memory.
+	if len(mainRepoIDs) > 0 {


Feels a bit error prone the number of places we need to remember to do cr.Close inside of this if statement. Maybe this whole if statement body can be factored out into a function, then we can do defer cr.Close?

Done — extracted indexCatfileBlobs() with defer cr.Close(). This eliminated all four manual cr.Close() calls (three error paths + one happy path). Any new error path added in the future is automatically covered by the defer.

keegancsmith · 2026-03-24T12:47:28Z

gitindex/index.go

+
+	for _, name := range names {
+		for _, key := range fileKeys[name] {
+			if key.SubRepoPath == "" {


I have a slight concern this breaks stuff. For a few weeks can we have an envvar switch off this behaviour (I guess if turned off then if this if statement is always false it will just work out?)

Then after a few weeks we remove the envvar. But this code path is pretty critical on the server side both at Sourcegraph and GitLab.

Done — added ZOEKT_DISABLE_CATFILE_BATCH. Follows the exact same pattern as ZOEKT_DISABLE_GOGIT_OPTIMIZATION already in this function: cmp.Or(os.Getenv(...), "false") + strconv.ParseBool. When set to true, the if key.SubRepoPath == "" condition is effectively always false, so all keys go through the go-git createDocument path. Benchmarked both paths on kubernetes — cat-file path is 4.4x faster, go-git fallback works correctly.

…issing Address review feedback on PR sourcegraph#1021: - Make Close() idempotent via sync.Once; kill the git process first (matching Gitaly's pattern) instead of draining all remaining stdout, so early termination is fast. Suppress the expected SIGKILL exit error. Add defer close(writeErr) in the writer goroutine to prevent deadlock on double-close. - Change Next() return and pending field from int64 to int, use strconv.Atoi. Removes casts at all call sites; SizeMax is already int. - Add SkipReasonMissing for blobs that git cat-file reports as missing, instead of reusing SkipReasonTooLarge. Missing is unexpected for local repos (corruption, shallow clone, gc race) so log a warning. - Extract indexCatfileBlobs() with defer cr.Close(), eliminating four manual Close() calls on error paths. - Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs fall back to the go-git createDocument path. - Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason). - Add 19 hardening tests covering Close lifecycle (double close, concurrent close, early termination), Read edge cases (partial reads, 1-byte buffer, empty blobs, read-without-next), missing object sequences, large blob byte precision, and duplicate SHAs. Benchmarked on kubernetes (29,188 files): no performance regression (geomean -0.89%, within noise).

clemlesne · 2026-03-24T13:54:27Z

Addressed all inline comments from the second review — replied to each one individually. Summary of changes in e0d964b:

Hardening:

Close() is now idempotent (sync.Once) and kills the git process first instead of draining, matching Gitaly's pattern. defer close(writeErr) prevents channel deadlock.
Extracted indexCatfileBlobs() with defer cr.Close() — eliminates all manual Close calls.
Added ZOEKT_DISABLE_CATFILE_BATCH env var kill switch (same pattern as ZOEKT_DISABLE_GOGIT_OPTIMIZATION).

Correctness:

Next() returns int instead of int64, uses strconv.Atoi.
New SkipReasonMissing for blobs git reports as missing (instead of reusing SkipReasonTooLarge).

Testing:

19 new hardening tests covering Close lifecycle, Read edge cases, missing objects, large blob byte precision, and more.
All tests pass with -race, both with and without ZOEKT_DISABLE_CATFILE_BATCH.

Performance:

Benchmarked on kubernetes (29,188 files) before and after — no regression (benchstat geomean -0.89%, within noise).
With kill switch enabled, go-git fallback works correctly (4.4x slower as expected, but functionally equivalent).

keegancsmith · 2026-03-24T16:44:39Z

gitindex/index.go

+					doc = skippedLargeDoc(key, branches)
+				} else {
+					content := make([]byte, size)
+					if _, err := io.ReadFull(cr, content); err != nil {


What I meant is the catfileReader API exposes a LimitedReader rather than directly using it's Read implementation. But yeah, this all works so was more just me thinking it through. For example I think that is how the archive reader API likely works.

clemlesne mentioned this pull request Mar 20, 2026

Replace go-git blob reading with git cat-file --batch for 2-3x faster indexing #1016

Closed

clemlesne marked this pull request as ready for review March 20, 2026 13:38

keegancsmith requested changes Mar 20, 2026

View reviewed changes

keegancsmith requested changes Mar 24, 2026

View reviewed changes

keegancsmith approved these changes Mar 24, 2026

View reviewed changes

keegancsmith merged commit 55e7126 into sourcegraph:main Mar 24, 2026
8 checks passed

keegancsmith mentioned this pull request Mar 26, 2026

gitindex: set filter for cat-file #1026

Merged

clemlesne mentioned this pull request Apr 1, 2026

perf(index): optimize indexing time by ~35% for large repositories #1035

Closed

4 tasks

clemlesne deleted the catfile-batch branch April 1, 2026 13:30

Conversation

clemlesne commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

Context

Design

Test plan

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clemlesne commented Mar 20, 2026

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clemlesne commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clemlesne commented Mar 20, 2026 •

edited

Loading